Mann–Whitney U Test (Wilcoxon Rank-Sum) — From Scratch with NumPy#
The Mann–Whitney U test is a nonparametric test for comparing two independent samples. Instead of comparing means (like a two-sample t-test), it compares ranks — which makes it useful when data is skewed, heavy-tailed, or ordinal.
What you’ll learn#
when to use Mann–Whitney U (and when not to)
the hypotheses it tests (what “significant” actually means)
how \(U\) is computed (wins + ranks)
a low-level implementation in NumPy (including ties)
how to visualize the null distribution via permutations
how to interpret results + effect sizes
import numpy as np
import plotly.graph_objects as go
import os
import plotly.io as pio
from math import erf, sqrt
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
1) When to use it#
Use Mann–Whitney U when you have:
two independent groups (different people, different devices, etc.)
a numeric or at least ordinal outcome
a reason to avoid a normality assumption (skew, heavy tails, small samples)
Common situations:
A/B testing for skewed metrics (latency, time-on-task)
small samples where “normal enough” is questionable
ordinal ratings (Likert) treated as ordered
Not appropriate when:
observations are paired / repeated measures (use the Wilcoxon signed-rank test instead)
you have more than two groups (use Kruskal–Wallis, then post-hoc tests)
data are not independent (clustered/longitudinal without modeling that structure)
2) What hypothesis does it test?#
Let \(X\) be a random draw from group A and \(Y\) from group B.
A convenient way to express the null hypothesis is:
So under \(H_0\), a value from A is equally likely to be larger than a value from B.
Alternatives:
two-sided: the distributions differ in either directiongreater: A tends to have larger values than Bless: A tends to have smaller values than B
Important nuance:
Mann–Whitney detects differences in distributions.
If the two groups have similar shapes/spreads and mostly differ by a shift, a significant result is often interpreted as a median shift.
3) Example data (skewed metric)#
We’ll simulate an A/B test on a skewed metric (think “response time”). Group B will be slightly smaller (faster) while keeping a similar shape.
rng = np.random.default_rng(7)
n_a = 25
n_b = 28
group_a = rng.lognormal(mean=0.10, sigma=0.55, size=n_a)
group_b = rng.lognormal(mean=-0.30, sigma=0.55, size=n_b)
{
"n": (n_a, n_b),
"mean": (group_a.mean(), group_b.mean()),
"median": (np.median(group_a), np.median(group_b)),
}
{'n': (25, 28),
'mean': (0.9872234195640509, 0.7597615779329113),
'median': (0.9504917274585137, 0.7089964448880488)}
fig = go.Figure()
fig.add_trace(
go.Violin(
y=group_a,
name="A (control)",
box_visible=True,
meanline_visible=True,
points="all",
jitter=0.25,
scalemode="width",
marker=dict(size=6, opacity=0.7, color="#636EFA"),
line=dict(color="#636EFA"),
)
)
fig.add_trace(
go.Violin(
y=group_b,
name="B (treatment)",
box_visible=True,
meanline_visible=True,
points="all",
jitter=0.25,
scalemode="width",
marker=dict(size=6, opacity=0.7, color="#EF553B"),
line=dict(color="#EF553B"),
)
)
fig.update_layout(
title="Skewed metric in two independent groups",
yaxis_title="Response time (arbitrary units)",
width=760,
height=450,
)
fig.show()
4) Core idea: replace values by ranks#
The test works on ranks:
Pool the two samples.
Sort them.
Assign ranks \(1,2,\ldots,n\).
Add up the ranks inside each group.
If there are ties, we assign the tied values their average rank.
def rankdata_average(values):
values = np.asarray(values)
order = np.argsort(values, kind="mergesort")
sorted_vals = values[order]
ranks_sorted = np.empty_like(sorted_vals, dtype=float)
_, first, counts = np.unique(sorted_vals, return_index=True, return_counts=True)
for start, count in zip(first, counts):
end = start + count
avg_rank = 0.5 * ((start + 1) + end)
ranks_sorted[start:end] = avg_rank
ranks = np.empty_like(ranks_sorted)
ranks[order] = ranks_sorted
return ranks
demo = np.array([10, 10, 20, 15])
rankdata_average(demo)
array([1.5, 1.5, 4. , 3. ])
5) From rank sums to \(U\) (the “wins” statistic)#
Let:
\(n_1\) = size of group A
\(n_2\) = size of group B
\(R_1\) = sum of ranks for group A (after pooling + ranking)
Then:
\(U_1\) can be interpreted as:
the number of (A,B) pairs where \(A > B\)
plus half the number of ties
And:
For a two-sided test, many references report \(U = \min(U_1, U_2)\).
def u_from_pairs(x, y):
x = np.asarray(x)
y = np.asarray(y)
gt = (x[:, None] > y[None, :]).sum()
eq = (x[:, None] == y[None, :]).sum()
return gt + 0.5 * eq
def mann_whitney_u_from_ranks(x, y):
x = np.asarray(x)
y = np.asarray(y)
if x.size == 0 or y.size == 0:
raise ValueError("Both samples must be non-empty")
n1 = x.size
n2 = y.size
pooled = np.concatenate([x, y])
ranks = rankdata_average(pooled)
R1 = ranks[:n1].sum()
U1 = R1 - n1 * (n1 + 1) / 2
U2 = n1 * n2 - U1
return U1, U2, ranks
x_small = np.array([3, 1, 5])
y_small = np.array([2, 4, 6])
U1_pair = u_from_pairs(x_small, y_small)
U1_rank, U2_rank, _ = mann_whitney_u_from_ranks(x_small, y_small)
(U1_pair, U1_rank, U2_rank)
(3.0, 3.0, 6.0)
U1, U2, ranks = mann_whitney_u_from_ranks(group_a, group_b)
U = min(U1, U2)
(U1, U2, U)
(476.0, 224.0, 224.0)
pooled = np.concatenate([group_a, group_b])
labels = np.array(["A"] * len(group_a) + ["B"] * len(group_b))
ranks = rankdata_average(pooled)
fig = go.Figure()
fig.add_trace(
go.Scatter(
x=ranks[labels == "A"],
y=pooled[labels == "A"],
mode="markers",
name="A",
marker=dict(size=8, opacity=0.85, color="#636EFA"),
)
)
fig.add_trace(
go.Scatter(
x=ranks[labels == "B"],
y=pooled[labels == "B"],
mode="markers",
name="B",
marker=dict(size=8, opacity=0.85, color="#EF553B"),
)
)
fig.update_layout(
title="Pooled ranks: lower values get lower ranks",
xaxis_title="Rank (1 = smallest)",
yaxis_title="Value",
width=760,
height=450,
)
fig.show()
6) The p-value: a null distribution for \(U\)#
Under \(H_0\), the two groups are exchangeable: the labels “A” and “B” shouldn’t matter.
So we can build a null distribution by:
pooling the observed values
randomly re-assigning \(n_1\) of them to “A” (the rest to “B”)
computing \(U_1\) each time
The p-value is the proportion of permutations with a statistic at least as extreme as what we observed.
def mw_u_permutation_test(x, y, alternative="two-sided", n_permutations=20_000, rng=None):
x = np.asarray(x)
y = np.asarray(y)
if x.size == 0 or y.size == 0:
raise ValueError("Both samples must be non-empty")
if rng is None:
rng = np.random.default_rng()
n1 = x.size
n2 = y.size
n = n1 + n2
pooled = np.concatenate([x, y])
ranks = rankdata_average(pooled)
obs_R1 = ranks[:n1].sum()
obs_U1 = obs_R1 - n1 * (n1 + 1) / 2
keys = rng.random((n_permutations, n))
idx = np.argpartition(keys, kth=n1 - 1, axis=1)[:, :n1]
perm_R1 = ranks[idx].sum(axis=1)
perm_U1 = perm_R1 - n1 * (n1 + 1) / 2
if alternative == "greater":
p = (np.sum(perm_U1 >= obs_U1) + 1) / (n_permutations + 1)
elif alternative == "less":
p = (np.sum(perm_U1 <= obs_U1) + 1) / (n_permutations + 1)
elif alternative == "two-sided":
lo = (np.sum(perm_U1 <= obs_U1) + 1) / (n_permutations + 1)
hi = (np.sum(perm_U1 >= obs_U1) + 1) / (n_permutations + 1)
p = min(1.0, 2 * min(lo, hi))
else:
raise ValueError("alternative must be 'two-sided', 'greater', or 'less'")
return obs_U1, perm_U1, p
rng_perm = np.random.default_rng(7)
obs_U1, perm_U1, p_perm = mw_u_permutation_test(
group_a,
group_b,
alternative="two-sided",
n_permutations=20_000,
rng=rng_perm,
)
{
"U1": obs_U1,
"U2": len(group_a) * len(group_b) - obs_U1,
"p_value (permutation)": p_perm,
}
{'U1': 476.0, 'U2': 224.0, 'p_value (permutation)': 0.024098795060246987}
mu_null = len(group_a) * len(group_b) / 2
d = abs(obs_U1 - mu_null)
fig = go.Figure()
fig.add_trace(go.Histogram(x=perm_U1, nbinsx=40, name="permuted U1", marker=dict(color="rgba(99,110,250,0.65)")))
fig.add_vline(
x=obs_U1,
line_width=3,
line_color="black",
annotation_text=f"observed U1 = {obs_U1:.1f}",
annotation_position="top",
)
fig.add_vline(
x=mu_null,
line_width=2,
line_dash="dash",
line_color="gray",
annotation_text="null mean",
annotation_position="top right",
)
fig.add_vrect(
x0=min(perm_U1),
x1=mu_null - d,
fillcolor="rgba(239,85,59,0.18)",
line_width=0,
)
fig.add_vrect(
x0=mu_null + d,
x1=max(perm_U1),
fillcolor="rgba(239,85,59,0.18)",
line_width=0,
)
fig.update_layout(
title=f"Null distribution of U1 under H0 (permutations), p ≈ {p_perm:.4f}",
xaxis_title="U1",
yaxis_title="Count",
width=760,
height=450,
)
fig.show()
7) Normal approximation (large samples)#
For larger samples, \(U_1\) is well-approximated by a normal distribution.
Mean under \(H_0\):
Variance under \(H_0\) (no ties):
With ties, we apply a standard tie correction to the variance.
Then we compute a z-score (often with a continuity correction) and get a p-value from the normal CDF.
def normal_cdf(z):
return 0.5 * (1 + erf(z / sqrt(2)))
def mw_u_normal_approx(x, y, alternative="two-sided", continuity=True):
x = np.asarray(x)
y = np.asarray(y)
if x.size == 0 or y.size == 0:
raise ValueError("Both samples must be non-empty")
n1 = x.size
n2 = y.size
n = n1 + n2
pooled = np.concatenate([x, y])
ranks = rankdata_average(pooled)
R1 = ranks[:n1].sum()
U1 = R1 - n1 * (n1 + 1) / 2
mu = n1 * n2 / 2
_, counts = np.unique(pooled, return_counts=True)
tie_term = np.sum(counts**3 - counts)
var = n1 * n2 / 12 * ((n + 1) - tie_term / (n * (n - 1)))
sigma = sqrt(var)
if sigma == 0:
return U1, 0.0, 1.0
cc = 0.0
if continuity:
if alternative == "greater":
cc = -0.5
elif alternative == "less":
cc = 0.5
elif alternative == "two-sided":
cc = -0.5 if U1 > mu else 0.5 if U1 < mu else 0.0
else:
raise ValueError("alternative must be 'two-sided', 'greater', or 'less'")
z = (U1 - mu + cc) / sigma
cdf = normal_cdf(z)
if alternative == "greater":
p = 1 - cdf
elif alternative == "less":
p = cdf
else:
p = 2 * min(cdf, 1 - cdf)
return U1, z, p
U1_norm, z, p_norm = mw_u_normal_approx(group_a, group_b, alternative="two-sided", continuity=True)
{
"U1": U1_norm,
"z": z,
"p_value (normal approx)": p_norm,
}
{'U1': 476.0,
'z': 2.2360857240006173,
'p_value (normal approx)': 0.025346156404938203}
8) Interpreting the result#
What a small p-value means:
If \(H_0\) were true (no tendency for A to be larger than B), then observing a \(U\) at least this extreme would be rare.
So a small p-value is evidence against \(H_0\).
What it does not mean:
It does not tell you the probability that \(H_0\) is true.
It does not tell you how large the difference is.
A simple, interpretable effect size is the probability of superiority:
\(A = 0.5\) suggests no tendency for A to be larger than B.
\(A = 0.65\) means a random A observation exceeds a random B observation about 65% of the time (counting ties as half).
n1 = len(group_a)
n2 = len(group_b)
A = obs_U1 / (n1 * n2)
cliffs_delta = 2 * A - 1
r_effect = abs(z) / sqrt(n1 + n2)
{
"probability_of_superiority (A)": A,
"Cliff's delta (2A-1)": cliffs_delta,
"r (|z|/sqrt(n))": r_effect,
}
{'probability_of_superiority (A)': 0.68,
"Cliff's delta (2A-1)": 0.3600000000000001,
'r (|z|/sqrt(n))': 0.3071499960863374}
fig = go.Figure()
fig.add_trace(go.Bar(x=[A], y=["A"], orientation="h", marker=dict(color="#00CC96")))
fig.add_vline(x=0.5, line_dash="dash", line_color="gray")
fig.add_annotation(x=A, y=0, text=f"{A:.2f}", showarrow=False, yshift=14)
fig.update_layout(
title="Common-language effect size: probability of superiority",
xaxis=dict(range=[0, 1], title="P(A > B) + 0.5·P(tie)"),
yaxis=dict(showticklabels=False),
width=760,
height=260,
showlegend=False,
)
fig.show()
9) A cautionary example: “same median” does not imply “no difference”#
Mann–Whitney U is often described as a test for a median shift. That interpretation is most defensible when the two distributions have similar shapes.
Here we build two samples with exactly the same sample median, but different shapes. It’s still possible for Mann–Whitney U to be significant, because the test is about relative ordering, not “median equality”.
rng_shape = np.random.default_rng(7)
n = 200
x = rng_shape.normal(0, 1, size=n)
y = rng_shape.exponential(scale=1.0, size=n) - np.log(2)
y = y - np.median(y) + np.median(x)
U1_xy, z_xy, p_xy = mw_u_normal_approx(x, y, alternative="two-sided", continuity=True)
{
"sample_median_x": float(np.median(x)),
"sample_median_y": float(np.median(y)),
"p_value (normal approx)": p_xy,
}
{'sample_median_x': -0.1307394221048977,
'sample_median_y': -0.1307394221048977,
'p_value (normal approx)': 0.010240851796336825}
fig = go.Figure()
fig.add_trace(
go.Violin(
y=x,
name="x (Normal)",
box_visible=True,
meanline_visible=True,
points=False,
line=dict(color="#636EFA"),
fillcolor="rgba(99,110,250,0.35)",
)
)
fig.add_trace(
go.Violin(
y=y,
name="y (Shifted Exponential)",
box_visible=True,
meanline_visible=True,
points=False,
line=dict(color="#EF553B"),
fillcolor="rgba(239,85,59,0.35)",
)
)
fig.update_layout(
title=f"Different shapes with the same sample median (p ≈ {p_xy:.4f})",
yaxis_title="Value",
width=760,
height=450,
)
fig.show()
10) Pitfalls + diagnostics#
Independence is crucial: if measurements are paired/repeated, switch tests.
Ties: common in discrete/rounded data. Average ranks + tie-corrected variance help, but consider a permutation test.
Report an effect size: a significant p-value can still correspond to a small \(A\) (small practical difference).
Distribution shape matters: if spreads/skews differ a lot, interpret the result as “distribution difference”, not “median difference”.
Power: like any test, Mann–Whitney can miss small effects with small samples.
11) Practical usage (SciPy)#
In real projects, you’ll usually call a library implementation.
SciPy’s mannwhitneyu returns the U statistic (for the first sample you pass) and a p-value.
Below is an optional sanity check against SciPy’s asymptotic (normal-approx) result.
try:
from scipy.stats import mannwhitneyu
res = mannwhitneyu(group_a, group_b, alternative="two-sided", method="asymptotic")
{
"ours_U1": float(U1_norm),
"ours_p_value": float(p_norm),
"scipy_U1": float(res.statistic),
"scipy_p_value": float(res.pvalue),
}
except Exception as e:
f"SciPy check skipped: {e}"
Exercises#
Implement \(U_1\) using pairwise comparisons and show it matches the rank-based formula (including ties).
Use the permutation approach to approximate a p-value for different sample sizes and see how it stabilizes as
n_permutationsincreases.Simulate power: pick a small shift in location and estimate how often Mann–Whitney rejects at \(\alpha=0.05\).
References#
SciPy documentation:
scipy.stats.mannwhitneyuConover, Practical Nonparametric Statistics
Lehmann, Nonparametrics: Statistical Methods Based on Ranks